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Abstract. The aim of this work is to define and implement an extended 
C++ language to support the SIMD programming paradigm. The C++ 
programming language has been extended to express all the potentiality 
of an abstract SIMD machine consisting of a central Control Proces- 
sor and a N-dimensional toroidal array of Numeric Processors. Very few 
extensions have been added to the standard C++ with the goal of min- 
imising the effort for the programmer in learning a new language and 
to keep very high the performance of the compiled code. The proposed 
language has been implemented as a porting of the GNU C++ Compiler 
on a SIMD supercomputer. 



1 Introduction 

The aim of this work is to define and implement an extended C++ language 
to support the SIMD Q programming paradigm. Our goal is to add minimal 
extensions to the standard CH — h language ||^ in order to minimise the syntacti- 
cal differences when porting standard C++ applications or writing new codes. 
Our decision to be always as close as possible to the standard lead to the defi- 
nition of an extended C++ language with very few constructs to learn for CH — h 
programmers, and relatively easy to use. 

Using our language, the SIMD parallelism is efficiently achieved with tradi- 
tional sequential programming plus a couple of new constructs (used to perform 
memory mapped internode communication and to inhibit execution of code on 
some processing nodes) and some knowledge of the native data types and their 
allocation. The programmer can thus focus on the realization of the algorithm 
and on the data distribution, which are the key points to exploit the parallel 
architecture. 

The proposed language has been implemented as a porting of the GNU C++ 
Compiler |^ [Q for the APEmille parallel supercomputer ||^,^ . Some modifications 
of the GNU C++ Compiler have been introduced, as well as the complete re- 
definition of the back-end for the target machine |^ . APEmille is a parallel SIMD 

^ Release 2.95.1 



computer developed at INFN (Italian National Institute for Nuclear Physics) 
capable of peak performance of 1 Teraflop in a configuration with 2048 processing 
nodes. 

The simplicity and low number of extensions to the standard language helped 
reaching the goal of efficiency of the executable parallel codes, main goal for any 
number crunching application running on a massively parallel supercomputer. 

In this paper, we describe the proposed SIMD C++ language, and especially 
those aspects which extend the standard C++ syntax or semantics. Section ^ is 
devoted to this description. Section |2| explains the abstract SIMD architecture 
which we refer to, while section ^ reports on those works related to our either 
for the language used (extensions of C/C++) or for a similar target architecture 
or parallel paradigm. Section |^ contains the conclusions. 

2 The Abstract SIMD machine 

SIMD machines consists of synchronized processing elements with an associated 
unique control processor. The Control Processor (later, CP) broadcasts the same 
instruction stream to all processing elements. All processing elements execute the 
same instruction at each clock cycle on their own data. In the proposed architec- 
ture the processing elements are specialized processors for numeric applications: 
we will call them Numeric Processors or NPs. The NPs form an N-dimensional 
toroidal array. Each NP consists of an ALU, an own register file, a local mem- 
ory and a local memory mass storage. There is no shared memory across the 
whole machine: communication among NPs is achieved through a memory map- 
ping mechanism that allows each NP to access memory of its neighbours. The 
machine is SIMD and guarantees no conflicts in memory accesses. 

2.1 Control Processor 

The CP handles integer data types, executes branches, function calls, and gen- 
erates memory address. Every instruction is sent by the CP to each NP in the 
machine, at the same clock cycle. 

Global addresses are broadcasted to all NPs. They will use them to address 
their own memory. 

2.2 Numeric Processor 

NPs are specialized in numeric instructions, in fact they natively support floating 
point data types - scalar (both single and double precision), vector/complex 
(couple of single precision) and the integer data type. 

NPs, in order to perform conditional execution, can test local conditions, 
and, when they are not met, can disable the effect of the following numeric 
instructions. We call the conditional test a where instruction. 

CP can also make all NPs test their local conditions. Then it can perform a 
global branch if any, all or none of the NPs has met the condition. 



NPs can address their own memory using the global address generated by 
the CP and, eventually, adding a local offset. 

3 Proposed SIMD C++ language 

In this section we describe our extended CH — h analyzing all the aspects more re- 
lated to parallelism. Subsection |3.9| contains a resume of the main characteristics 
and discusses general topics. 

3.1 Types, Declarations and Allocation 

Basic Data Types With basic data types or basic types we refer to the types 
natively supported by the language, as int or float. The CH — h language pro- 
posed in this paper includes all the basic data types supported by the illustrated 
abstract SIMD machine. Namely: 

int, float, double, complex, vector, localint 

The localint data type are integer variables allocated in the numeric processors 
(see later), while the types vector and complex represent a pair of float and are 
treated as native types by our abstract machine. Pointers, arrays and function 
pointers are supported for every type and every level of indirection. 

These types are all signed. All the "standard" C-|--t- types (e.g.: long long, 
long double, char, etc. and all unsigned types) could be supported, performing 
software emulation for those not supported by the physical machine. 

Declaring variables is absolutely identical to standard C++. It is also pos- 
sible to declare new types with the typedef keyword just as in C++, with the 
standard rules and no limitation. 

Allocation The variables declared are allocated in the Control Processor (CP) 
or in the Numeric Processor (NP) depending on their type. We divide the ba- 
sic types into two groups: the Control Processor types (int, pointers) and the 
Numeric Processor types (float, double, vector, complex, localint). 

CP variables are allocated in the (unique) Control Processor, so there is just 
one instance of them; NP variables are allocated in each NP data memory, and, 
most important, at the same location in every one: so there are multiple instances 
of numeric variables. This is the essence of the SIMD programming paradigm, 
and implies a very important fact: memory images of NPs are all identical; each 
allocation, both static and dynamic, is the same for every NP. 

Arrays of any type are allocated in the same processor of the base type. For 
example, the array 



double a [100000] ; 



is allocated in the NPs' memories. However the base of the array, which is known 
at compile time, is a pointer, and so it is handled by the CP. 

The allocation mechanism is automatic and controlled by the compiler: there 
is no way and no reason for the programmer to alter it. On the other hand data 
distribution is left to the programmer. We will discuss this topic in subsection 




3.2 Expressions 



Expressions within the same type Handling expressions among variables 
of the same type is not ambiguous, because they are allocated in the same kind 
of processor so code for "that" processor will be generated to handle them. For 
example: 



CP allocation 



CP code 



CP code 



int i , j , k ; 
double a,b, 
i = j+1; 
a = 1.0; 
b = a*c-b; 
k++; 



NP allocation 



NP code 



NP code 
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Table 1. Allowed Promotions 



Mixed-types expressions They are handled by promoting types, or by explicit 
cast by the programmer. There are specific rules about cast and promotions. 



— cast/promotions from CP to NP types are ALWAYS allowed. 

— cast/promotions from NP to CP types are NEVER allowed. 

— cast/promotions between two types of the same group are allowed depending 
on the specific types. 
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Table 2. Allowed Casts 



It is obvious that a cast /promotion from a CP type to NP generates multiple 
instances of one value. 

3.3 Multiple Addressing 

As stated before, the abstract SIMD machine includes the ability to add a local 
offset when accessing local NP memory, so every NP could access a different lo- 
cation in memory. This is realized with the localint variables, that are integers 
allocated in the NPs. These values can be used to add a local displacement when 
accessing local memory. 

A pseudo function localof f set () can be called with a localint argument 
to set the local offset for the following memory access. 

int i ; 

localint li; 
float r, a[100] ; 
// ... 

localof f set (li) ; 
r = a[i] ; 

In the example above, access in array a is at index i + li. 

3.4 Type Constructors: Structs, Classes and Unions 

It is possible to declare a new type using struct, class or union as in standard 
C-l— 1-. Structs and classes can contain data fields of any other data type (both 
CP types and NP types) , while unions must contain fields associated to the same 
kind of processor (only CP or only NP), due to allocation reasons, as explained 
before. 

class Mixed { 
int a; 



float x; 
public : 

Mixed (int aa, float xx) : a(aa) , x(xx) {}; 

>; 

Each field is allocated in the respective processor so that multiple instances 
of numeric field exist. The effort to address them and to keep pointers consis- 
tent is made by the compiler. Fields must be accessed directly with pointers: 
incrementing and decrementing pointers to " navigate" through a struct or class 
could generate unpredictable results because the object is allocated on different 
memories. The space allocated is compacted, so that only the necessary size is 
allocated in each kind of processor. 



3.5 Object Oriented Features: 

Encapsulation, Inheritance, Polymorphism 

Encapsulation is handled as in standard CH — h with no other extension nor lim- 



itation. Field allocation follows what stated in 3.4. Methods are called passing 
them the invocation object as an hidden argument. The same method is executed 
by each NP. 

Also Inheritance and Polymorphism have no extensions nor limitations. Non- 
virtual base class members are inserted in the CP or NP instance layout of the 
object after their type class. Virtual base class members and virtual classes 
information are inserted into the CP instance of the object: in fact they are 
pointers. 



3.6 Communication 

Communication among Numeric Processors is achieved through memory map- 
ping. The proposed C-I--I- language allows to address an array element in a remote 
NP by summing a constant to the array index or to the pointer that would be 
used for local access. Different pre-defined constants are associated to neighbour 
NPs. These constants specify the relative position of the NP to be accessed with 
respect to the current NP. The constants are generated and handled on the CP, 
so they are the same for all NPs. The following example shows communication 
between NPs: 

float r, V [100000] ; 

r = V [3+XPLUS_NP] ; // each NP accesses the 3rd element of 
// the nearest neighbour on the x axis 
// in the positive direction 

The constant XPLUS_NP is machine dependent. 

It is possible to use a remote object as parameter or invocation object of a 
method. In this way, a code like: 



class C 
{public : 
float x; 

void f (float y) { x = y; } 

>: 

int mainO 
{ float a; 

C v[10] ; 

// ... 

V [0+XPLUS_NP] .f (a) ; 
// ... 

> 

assigns to the x field of the v[0] object on each node the value of a on the 
adjacent (XMINUS_NP) node. 

3.7 Local Conditions 

The instruction flow being unique, it is possible to branch only when global 
conditions (conditions on the CP) are met. Conditions on local variables (on 
NPs) can be handled with the where-elsewhere keywords^. Conditioned code 
will be executed only by those NPs that met the condition. All the other NPs 
will execute NOPs. CP instructions inside a where block, on the other hand, are 
always executed, where-elsewhere are used just like if-else, as shown in the 
following example. 

int i ; 
double x,y; 
// ... 

where (x != 0.0) 

{ y = 1/x; 

} 

elsewhere 
{ y = 0; 
> 



3.8 Examples of implementation of SIMD programs 

When writing a program for a SIMD machine using the proposed CH — h exten- 
sions, the "SI" part of the SIMD paradigm is realized using a single instruction 
stream, as the C-I--I- language naturally does; the "MD" part of SIMD is achieved 
allocating multiple instances of the numeric variables. 



^ In our implementation, these are not keywords but function names 



The initial loading of different data in each Numeric Processor data memory 
is made by the operating system, while the slicing of a big array into the NPs 
must be determined by the programmer. 

For example, suppose that the problem needs to handle an array of 10000 x 
10000 elements, and that the target machine has a 2D square topology with 
10 X 10 = 100 Numeric Processors. The programmer will declare a 1000 x 1000 
array (and every NP will have its own instance of this array, that represents a 
slice of the big array) . A very trivial example of code that implements the sum 
of two 10000 X 10000 elements arrays can be useful to explain the parallelization 
mechanism and the data distribution: 

int mainO 

const int dimx = 1000; 
const int dimy = 1000 ; 
const int size_per_node = dimx*dimy; 

float ml [dimx] [dimy] , m2 [dimx] [dimy] , m3 [dimx] [dimy] ; 
const char *filename = "myf ile.data" ; 
// ... 

distributed_load(ml , filename, size_per_node) ; 
distributed_load(m2, filename, size_per_node) ; 
for (int i=0; i<dimx; i++) 
for (int j=0; j<dimy; 

m3[i] [j] = ml[i] [j] + m2[i] [j] ; 
distributed_store(m3, filename, size_per_node) ; 
// ... 

> 

Numeric instructions will be executed in parallel by the Numeric Processors 
on their local data, while Control Processor will execute flow control, integer in- 
structions and will generate memory addresses. The distributed_load() and 
distributed_store() functions perform machine dependent system calls sup- 
posed to load and store data in the appropriate way. 

3.9 General remarks on the language proposed 

The proposed C++ is very similar to the standard C++, is easy to learn, to use 
and to debug, produces highly efficient executable codes, and can be used in a 
professional environment. 

The most important aspect of the proposed language is that it is a minimal 
extension to the C++ standard. This is a key feature as we want the programmer 
to concentrate on the application development rather than paying attention to 
implementation aspects. 

Our language strictly conforms to the machine architectural characteristics 
in order to fully exploit the simplification that the SIMD synchronous structure 



and the memory mapped internode communications introduce in the task of 
writing the paraUel algorithm. 

As a result, there is no need to develop multi-threaded programs nor to use 
any special communication library. 

Finally, object distribution is obtained by the simple allocation model de- 
scribed above. All objects are replicated on each processing element and invoca- 
tion of a method is executed on all NPs (or on the subset of them satisfying an 
eventual WHERE condition). 

4 Related work 

In this section we compare our language to a couple of parallel C/CH — I- exten- 
sions and to High Performance Fortran, which is the standard for data parallel 
applications. 

HPC-f + 1^,^ is a set of class libraries and tools that extend the C++ lan- 
guage. It also has a set of runtime systems that are required for remote access. It 
refers to a very general architectural model, so it can be used on a variety of ma- 
chine architectures. There are two main execution modes for programs written 
in HPC++: 

1. multi-thread shared memory mode, suitable for coarse-grained applications 
with some particular collective operations for thread synchronization. 

2. Single Program Multiple Data (SPMD) mode, in which n copies of the same 
program run on the n processing nodes. This mode is similar to using C / C-l — h 
with MPI or PVM. The programmer must manage the data distribution and 
the synchronization of processes. 

HPC++ has thus a totally different approach compared to our one, and this 
approach is not applicable to our language architecture which is focused on 
single threaded programs. 

pC++ 1^ is a C++ extension that provides a thread-based programming 
model and a simple way to encapsulate SPMD code in it, together with a mech- 
anism for data distribution similar to the one adopted in HPF (see below). The 
key concept of this extension is the collection of objects. It is possible to invoke 
a method on an entire collection or on a part of it. The compilation of pCH — h 
code is achieved as a translation into standard C++ by a preprocessor. 

The HPF ijl^ programming model has the following key points: 

1. single threaded control; 

2. global namespace, low-level of data distribution and remote communication 
details hidden to the programmer; 

3. loosely synchronous: synchronization of program execution on different nodes 
is accomplished only at special points (e.g. the completion of a loop) and 
not instruction by instruction; 

4. parallel operations: operations on array elements executed at the same time 
over all nodes. 



HPF extends the Fortran language adding compiler directives, libraries and 
new language constructs. 

The most relevant compiler directives to our purposes are those related to 
data distribution. This is accomplished in three steps: in the first step an align- 
ment is defined for arrays, in the second stop aligned data are mapped on a 
abstract set of processors and finally this set is mapped onto physical proces- 
sors. This is quite different from our approach as we force the programmer to 
take care of allocation of large matrices as described above. In particular we 
rely on operating system calls to perform something similar to the BLOCK and 
DEGENERATE distribution types. 

A similarity between HPF and our language is the specification of locally con- 
ditioned code execution via the WHERE statement, although the HPF version 
accepts logical-array arguments while ours accepts any logical condition between 
NP data. Other parallel constructs, such as FORALL, are not provided by our 
language. 

HPF is the parallel extension to a standard language that, compared with 
the other two, best matches with our approach. 

5 Conclusions 

The next topics to be analyzed will include exception handling, RTTI (Run Time 
Type Information) and namespaces. 

Our implementation of the compiler based on a porting of the GNU CC 
compiler on the APEmillc architecture is currently under test. We plan to discuss 
our implementation in a further paper. 
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